Input factory #2168

natoverse · 2026-01-10T00:43:04Z

Pulls input document processing into its own package
Revamps the factory to match our factory pattern
Adds jsonl and markitdown as input processors
Separates the storage config into its own block
Cleans up metadata handling to be entirely a chunking concern unrelated to document ingest

- Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module

- Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument

natoverse added 27 commits January 5, 2026 16:35

Update input factory to match other factories

99aea52

Move input config alongside input readers

efaaa1f

Move file pattern logic into InputReader

2b89384

Set encoding default

c73263d

Clean up optional column configs

b265612

Combine structured data extraction

f066080

Remove pandas from input loading

2b83d66

Throw if empty documents

a03df1b

Add json lines (jsonl) input support

8b45208

Store raw data

6ac0b58

Merge branch 'v3/main' into input-factory

8e3c717

Fix merge imports

fb9a924

Move metadata handling entirely to chunking

e2395e9

Nicer automatic title

36b7be7

Typo

9d161bd

Add get_property utility for nested dictionary access with dot notation

164c5e1

Update structured_file_reader to use get_property utility

868fde1

Back-compat comment

2f6d075

Align input config type name with other factory configs

a671aa4

Add MarkItDown support

6d5076a

Remove pattern default from MarkItDown reader

6fbf26c

Remove plugins flag (implicit disabled)

e19501d

Format

6fba8d0

Update verb tests

c974970

Separate storage from input config

7ce1030

natoverse requested a review from a team as a code owner January 10, 2026 00:43

natoverse added 2 commits January 9, 2026 16:55

Add empty objects for NaN raw_data

e170124

Fix smoke tests

ade3a6f

natoverse added 2 commits January 12, 2026 10:29

Fix BOM in csv smoke

89a5223

Format

ad76163

gaudyb approved these changes Jan 12, 2026

View reviewed changes

natoverse merged commit 710fdad into v3/main Jan 12, 2026
14 checks passed

natoverse deleted the input-factory branch January 12, 2026 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Input factory #2168

Input factory #2168

Uh oh!

natoverse commented Jan 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Input factory #2168

Input factory #2168

Uh oh!

Conversation

natoverse commented Jan 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants